Session 1

Lauren Yee

March 31, 2020

Session 1:

Course Materials: www.mapdatascience.com/ggplot

About me

Wearer of many hats! Data Wrangling, Data Visualization, Modelling, Dashboards, Web Development and Research.

Prior to consulting role, studied emerging infectious diseases and spatial epidemiology


Lauren@mapdatascience.com
@EcoLaurenY
Research Gate

Lauren Yee
Data Scientist

Objectives

  1. Use the ggplot2 library to visualize data
  2. Use tidyverse packages to analyze data
  3. Create reproducible PDF and Word Reports using RMarkdown
  4. Create interactive reports

What this course is:

What this course is not:

Data Visualization

Data Visualization Principles

What question need to be answered?

Who is your intended audience ?

What are you trying to show ?

How can a visualization show this or make relationships more clear?

Bad Visualizations

Source: Economist https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368

The Good

John Snow, 1854

Source: Bill Rankin

Source: reddit.com/r/dataisbeautiful user:TrustLittleBrother

Data Visualization Principles

Size: The size of points are varied based on the values or ratios from the data.

Value/Contrast – how light or dark a particular colour is while hue is held constant

Hue is the use of color – used to distinguish between classes or categories (agricultural land vs urban land use) Hues can be combined with different textures or shapes if there are a large number of categories. Typically when printing images that are in black and white.

Saturation – a mixture of grey and a hue, the intensity of color.

Hue, Contrast, Saturation and Colour are important to understand in terms of accessibility and colourblindness.

Source:Goodchild and Rhind 2015

Shape

Shape can be used to divide up categories, draw attention to a particular data point or shape.

In ggplot2 we have the option of these shapes:

Contrast

The New York Times

The New York Times

Colour

Sequential : Great for low to high values go from light colours to darker

Diverging: highlight differences e.g. standard deviation

Qualitative: Used to present categorical data, soil types, type of neighbourhood

Colour Blindness

Source: Viridis Colour Scales (Green-Blind (Deuteranopia))

Colour

Colour with no purpose takes away from the meaning of the data, aka rainbow effects

Choosing a visualization

With all the plotting libraries available in R - how do you choose?

There are some great flowcharts and websites to make the decision easier when stuck:

Tidyverse

A collection of R packages that make it easier to work with data.

Tidy Data

It is often said that 80% of data analysis is spent on the cleaning and preparing data. The goal of tidyr is to help you create tidy data. Tidy data is data where:

  • Every column is variable.
  • Every row is an observation..
  • Every cell is a single value.

“data tidying: structuring datasets to facilitate analysis.”

Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. It is an attempt to standardize data.

All tidyverse packages are designed to work with tidy data.

Setting up your environment and projects

R uses a “working directory” which is where R will first look for files that you want it to load and where it will save. A recommended structure for R projects is as follows:

A great tool in R is to “Create a New Project”, which will then be mapped to your “Workspace”. This is similar to use setwd() in R, however all projects by default are mapped to the folder it is saved in.

See also: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects

R Markdown

https://rmarkdown.rstudio.com/

R Markdown

Designed to be used in three ways:

  1. Communication to decision makers

    • high level conclusions and visualizations
  2. Collaboration with teams

    • Including code, methods and approach
  3. Environment to do data analysis

    • A modern day lab notebook including what you did, your code, and why you did it that way

R Markdown

All R markdown documents end in .rmd as opposed to .R. The start of a markdown file is called a “YAML”. Here you can specify the title and other meta data attributes to your file, as well as the outputs generated. Such as a pdf, word file, or html document

Each “chunk” represented by ``` of R code can executed independently and visualizations are generated in-line.

Chunk Options

eval = FALSE prevents code from being evaluated. This is useful for displaying example code, or for disabling a large block of code without commenting each line.

include = FALSE runs the code, but doesn’t show the code or results in the final document.

echo = FALSE prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who don’t want to see the underlying R code. Or to show a figure generated by ggplot2

Change Theme

In R